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1. INTRODUCTION 

Nowadays, supervisory control and data acquisition (SCADA) systems are widely used in power 
transformer stations and industrial factories. To keep the regular operation of such systems, data collection, 
data processing, and, ensuring the integrity of the data are very important. SCADA systems can be attacked 
not only on the physical infrastructures but also on the communication and supervisory control layers, as 
shown in Figure 1 [1]. Al, A2, A3 are attack points aimed at the supervisory control layer and through 
applications on a web server to spread viruses that destroy the control and supervising network configuration. 
The attack at point A4 is to occupy access to communication channels between the control center and 
stations. A5, A6 are attack points aimed at the communication link between MTU and PLC/RTU. A7 is an 
attack point on the network connection between factories and their contractors. The attack at point A8 aims at 
field terminal devices. The attack at points A9 and A10 aim at the signal lines from controllers to actuators, 
and the feedback signals from sensors to controllers, respectively. At point AO, attacks are all direct 
mechanical impacts to physical layer devices of SCADA systems. To exploit SCADA protocol weaknesses, 
attackers usually use four general types of attack: interception, interruption, modification, and fabrication [2], 
[3]. They can target the network infrastructure, RTU/PLC, and HMI of SCADA systems. Therefore, data 
safety studies for industrial control systems are of great interest. There are two main directions: the research 
on new attack types to test the ability of information security methods, and the second research direction is to 
focus on building methods to detect intrusions. 

According to the first research direction, it is possible to classify several attack methods as denial of 
service (DoS) attacks, data integrity attacks (between layers, or in each layer of a control system) such as 
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falsifying information, inserting fake information [5]. The second research direction, which is on information 
security, is currently received much attention [6]. For data intrusion detection problem, traditional machine 
learning approaches [7]-[11] and deep learning neural network architectures (for big data problems) [12]-[16] 
are widely used. Otherwise, many attempts to build datasets for SCADA intrusion detection have been 
accomplished [14], [17]-[19]. 
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Figure 1. Possible attack points to SCADA systems [4] 
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A standard simulation dataset (gas pipeline dataset) of data falsification attacks in industrial 
networks is provided by Mississippi State University's in-house SCADA lab in 2015 [17]. Using this dataset, 
many methods are proposed to improve the ability of intrusion detection. The authors presented an approach 
of using the random forrest (RF) model to examine three data-split strategies for classifying categories of 
SCADA attacks [20]. 70-30 split strategy results showed 5-17% and 2-8% improvement of accuracy for the 
classification of reading response attacks and write command attacks, respectively. The authors proposed a 
hybrid model which consists of a GoogLeNet neural network and a long short-term memory neural network 
(GoogLeNet-LSTM) for intrusion detection of industrial control systems [21]. The accuracy of the proposed 
model reached 97.56%. Two separate model (support vector machine (SVM) and RF) were trained and 
evaluated [3]. The proposed RF model reached the accuracies of 99.58% and 99.41% for binary and 
categorical classification tasks, respectively. An ensemble of stacked autoencoder model proposed in [22] 
achieved the accuracy of 95.86% and 93.83% for fl-score. The authors of [23] presented a LSTM model 
which showed the accuracy and fl-score of 92% and 85% respectively. Different classification methods 
(including K-means, Naive Bayes (NB), principal component analysis-singular value decomposition (PCA- 
SVD), guassian mixture model (GMM)) were examined for detecting intrusion of gas pipline dataset [24]. 
Among them, the K-means model showed the best accuracy of 83.19%. Some machine learning models 
(HoeffdingTree, NaiveBayes, RandomTree, BayesNet and OneR) were investigated with some techniques of 
cost-sensitive learning and Fisher’s (linear) discriminant analysis (FDA) [25]. The random tree algorithm 
combined with cost-sensitive learning enhancements showed the best prediction performance with the 
accuracy of 97.8%. 

This paper proposes a stacking ensemble model that combines three single models to detect data 
intrusion in SCADA systems. The gas pipeline dataset will be used to validate our proposed model. The 
following parts of the paper are organized as shown in: the second section presents the dataset and 
methodology; experiments and results are shown in the third section; finally, some conclusions and our future 
works are presented in section 4. 
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2. DATA SET AND METHODOLOGY 
2.1. Gas pipeline dataset 

The gas pipeline dataset consists of 274,628 instances, and each instance contains 17 features. These 
instances indicate the state and parameters of Modbus frames in a gas pipeline SCADA system, with three 
different types of labels showing the state of the network. The 17 features of each instance indicate the 
network (Adress, CRC, C/R, ...) and payload information in Table 1. The network status information is 
divided into three groups: binary results, categorical results, and specific results. In this work, only binary 
and categorical results are used. The Binary results contain two states: Normal and Attacked states. The 
categorical results consist of 7 types of attacked and normal states in Table 2. The dataset was also 
introduced as a heavily imbalanced dataset, with the number of Normal instances accounts for 78.1% and the 
number of attacked samples comprises 21.9% of the total cases. 


Table 1. Three first rows of the raw dataset with 17 features and 2 label types 


Adress Function Length Payload CRC C/R Timestamp Binary result _ Categorical result 
4 3 16 _—_—?, 2,2, ?,?,?,?,?,2,2,2 12869 1 1418682163.170388 0 0 
4 3 46 __?,?,?,?,?,2,2,2,2,7,0.689655 12356 0  1418682163.269946 0 0 
4 16 90 10,115,0.2,0.5,1,0,0,1,0,0,? 17219 1 1418682164.995590 0 0 


Table 2. Description, category of the attacks 


Attack description Threat type Abbreviation 

Normal N/A Normal (0) 
Naive malicious response injection Modification/Fabrication © NMRI (1) 
Complex malicious response injection Modification/Fabrication CMRI (2) 
Malicious state command injection Modification/Fabrication MSCI (3) 
Malicious parameter command injection Modification/Fabrication MPCI (4) 
Malicious function code injection Modification/Fabrication | MFCI (5) 

DoS Interruption DoS (6) 
Reconnaissance Interception Recon (7) 


2.2. Data pre-processing 

As seen in Table 1, many payload features are not available, making it impossible to train any 
classifier on these data since machine learning models usually require fixed-size inputs. Missing data 
commonly occurs in machine learning, and many approaches can be used to handle this problem. In this 
research, we used the "keep prior values" strategy, which was one of four methods demonstrated in [3] to 
impute all missing values of the dataset. In this way, all missing values of a row in the dataset will be 
attributed to the nearest non-missing row values above/below it (Table 3). After handling missing data, the 
dataset is divided into a training and a testing set (with the rate of 80%-20%, respectively) to train and 
validate our classification method, both training and testing set have the same attack/normal ratio. The 
min-max normalization is applied to normalize datasets. 


Table 3. Three first rows of the raw dataset after using the "keep prior values" imputation strategy 
Binary Categorical 


Address Function Length Payload CRC C/R Timestamp 


result result 
4 3 16 10,115,0.2,0.5,1,0,0,1,0,0,0.689655 12869 1 1418682163.170388 0 0 
4 3 46 10,115,0.2,0.5,1,0,0,1,0,0,0.689655 12356 0 1418682163.269946 0 0 
4 16 90 10,115,0.2,0.5,1,0,0,1,0,0,0.689655 17219 1 1418682164.995590 0 0 


2.3. Machine learning techniques 
2.3.1. Classification metrics 

The most ubiquitous metric for classification tasks is accuracy, which can be formulated as the ratio 
of total correct predictions to all predictions. As described earlier, the gas pipeline dataset is exceptionally 
imbalanced, making the accuracy be an ineffective metric for classification tasks. For example, regarding this 
dataset, if all instances are predicted as 0 (normal state), the accuracy, in this case, is 78.1% (equal to the 
proportion of the normal state), which is considerably high for a classification task. Still, the model's 
performance is awful. It is even worse when applying to intrusion detection, where incorrectly detecting any 
attack to be normal is more severe than misclassifying a normal state as an attack. Because of that, using 
other metrics to evaluate a classification model rather than the accuracy only is necessary. Notions and 
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formulas of some standard metrics-which are prevalent in imbalanced dataset classification tasks-are 
presented in (1) to (4). The accuracy score is defined as in (1): 


accuracy = EPELEN 100% (1) 


TP+FP+FN+TN 


Precision is defined as the ratio of correctly predicted positive observations to the total predicted 
positive observations, which as shown in (2): 


precision = = (2) 
Recall is defined as the ratio of correctly predicted positive observations to all observations in class 
0 (Normal). It is indicated as in (3): 


recall = —*— (3) 


TP+FN 
fl-score is the weighted average of precision and recall, as shown in (4): 


fl score aoe recall*precision 


(4) 


recall+precision 


Where true positives (TP): the number of instances in class 0 (normal) which the model indeed predicts; true 
negatives (TN): the number of instances in class 1 (attack) which the model truly predicts; false positives 
(FP): the number of instances in class 0 (normal) which are falsely predicted by the model. false negatives 
(FN): the number of instances in class | (attack) which the model falsely predicts. 

The values of precision, recall, and fl-score are non-negative numbers and smaller than one. High 
precision relates to the high rate of correctly classifying positive instances. A high value of recall means that 
the true positive rate is high (the rate of misclassifying positive instances is low). The fl-score takes both 
false positives and false negatives into account, and it is used to seek a balance between precision and recall, 
which is significant in evaluating an imbalanced dataset. 


2.3.2. Classification models 

In this work, we choose four different models, which are RF [26], light gradient boosting machine 
(LGBM) [27], eXtreme gradient boosting (XGBoost) [28], and multilayer perceptron (MLP), to construct our 
ensemble model for the intrusion detection task. RF, LGBM, and XGBoost are the same type of tree-based 
models. These models are chosen because of their fast training-speed and effectiveness in classification tasks. 
MLP will be used to make the final decision. 

In machine learning, there are various types of models that can be used for classification tasks. 
Every model has its own merits and defects. This type of model can perform well in a specific situation but 
maybe bad for others. Therefore, to combine the advantages of individual models, many approaches called 
ensemble learnings were developed. Ensemble learning combines multiple models to build a stronger one to 
solve a particular problem. Some popular ensemble algorithms are boosting, bagging, and stacking. Bagging 
(stands for bootstrap aggregating) is a way to decrease the variance of the prediction. This is done by 
generating additional data for training from the original dataset. By increasing the size of the training set, the 
variance of predictions can be decreased to increase the reliability of predictions. The boosting algorithm first 
uses subsets of the original data to produce a series of averagely performing models and then "boosts" their 
performance by combining them, using a particular cost function (majority vote). Unlike bagging, in classical 
boosting, the subset creation is not random. It depends on the performance of previous models. Every new 
subset contains the elements that previous models misclassified. For the stacking approach, first, several 
models are applied to original data. Then, a meta-level classifier is used to make final decisions. This 
classifier uses outputs of every first-level model as its input data. For RF, LGBM, and XGBoost, RF is a 
bagging model, while LGBM and XGBoost are boosting models. These three models use multiple decision 
tree classifiers to generate the final prediction. 

In this work, we propose an ensemble model, which uses a stacking strategy to combine three 
different classifiers. The structure of our stacking model is presented in Figure 2. The best hyper-parameter 
set of each model will be chosen in the first level, using cross-validation (CV) and random search. When all 
hyper-parameters for each first-level model are selected, these classifiers will be trained on the whole training 
dataset. Outputs of these models, then, will be used to train a multilayer perceptron neural network (MLP) as 
the meta-classifier of the stacking model. 
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Figure 2. The architecture of the stacking model 


3. IMPLEMENTATION AND RESULTS 

In this work, to implement the RF model, the Scikit-learn package was used [29]. The LightGBM 
library was used for the LGBM model, and the XGBoost package was used for the XGB model. The Scikit- 
learn package was also used to normalize data, train and evaluate models. Finally, the stacking model was 
trained by using the Mlxtend package. All our codes are written in Python language. 


3.1. Model parameter selection 

The parameter selection for first-level models is implemented on training data, using a 5-fold CV 
and random search. By this way, the training data was equally divided into five separate subsets. Each first- 
level model (with a hyperparameter set) will be trained five times, separately. At each time, four subsets will 
be used for training and the remaining one will be used for evaluation. Then, the average accuracy of five 
folds will be used to choose the best hyperparameter set of each first-level model. The results are shown in 
Figure 3. Twenty tree-based classifiers with different parameter sets are tested and evaluated. 

According to the binary detection task (2 classes of Normal/Attacked), as shown in Figure 3(a), the 
LGBM model reaches the best accuracy of 95.08%, with the hyper-parameter set of {'n_estimators': 104, 
‘min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth’: 61, 'bootstrap': False}. For 
RF model, the best accuracy score is 96.56%, with the hyper-parameter set of {'colsample_bytree': 
0.4029038429583326, 'max_depth': 39, 'min_child_samples': 293, 'min_child_weight': le-05, 'n_estimators'’: 
865, 'num_leaves': 48, 'scale_pos_weight': 1, 'subsample': 0.20256029767506548}. For XGB model, the 
highest accuracy is 98.51%, archieved with the parameter of {'colsample_bytree': 0.6173735153519851, 
‘gamma’: 1.5, 'max_depth': 94, 'min_child_weight': 5, 'n_estimators': 163, 'subsample’: 0.935523447634425}. 

Regarding the categorical detection task (8 classes: 1 Normal/7 types of Attacked), Figure 3(b) 
shows that, the LGBM model reaches an accuracy of 96.68% with the hyper-parameter set of {'n_estimators': 
272, 'min_samples_split’: 2, 'min_samples_leaf': 1, 'max_features': 'auto', 'max_depth': None, ‘bootstrap’: 
True}. The best accuracy score of RF model is 97.99%, achieved with the hyper-parameter set of 
{'colsample_bytree'’: 0.9182559913420497, 'min_child_samples': 336, '‘min_child_weight': 0.01, 
‘n_estimators': 824, 'num_leaves': 47, 'reg_alpha': 0.1, 'reg_lambda': 0, 'scale_pos_weight': 2, ‘subsample’: 
0.4647679729857 1926}. For XGB model, the highest accuracy is 99.21%, achieved with the parameter set of 
{'colsample_bytree': 0.9048840146153558, ‘gamma’: 0.5, 'max_depth': 38, 'min_child_weight': 10, 
‘n_estimators’: 242, 'subsample’: 0.600878069464399}. 

The models with the highest accuracy score were chosen, trained, and evaluated on a full 
training/testing dataset. The final efficiency of each model is given in Table 4 and Table 5. All predictions of 
these models will be used to optimize hyper-parameters of the MLP meta-classifier. For MLP, we fix the 
number of hidden layers as one. Then, the number of neurons in the hidden layer is optimized using a random 
search strategy. The optimization results are shown in Figure 4. For the binary detection task, the stacking 
model reaches an accuracy of 99.72%, with 24 neurons in the hidden layer of the MLP meta-classifier in 
Figure 4(a). For the categorical detection task, the accuracy of the stacking model is up to 99.62%, with 97 
neurons in the hidden layer of the MLP in Figure 4(b). 
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Figure 3. Performance of first-level models by random search CV (a) for 2 classes and (b) 8 classes 
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Figure 4. Performance of meta-classifier on the test set by the random search (a) for 2 classes and (b) 8 classes 
Table 4. Performance of each classifier on test set for the binary detection task 
classification report-2 classes (7 attack types + normal) 
Model as LORM r RE xo 
Precision _ Recall fl-score _ precision Recall fl-score__ precision Recall fl-score Support 
Normal(0) 98.76037 95.24922 96.97302 99.46873 98.90639 99.18677 99.93243 99.46657 99.69895 43117 
Attack(1) 82.398 94.89835 88.20751 96.06994 98.06221 97.05585 98.08493 99.75442 98.91263 11809 
Weighted avg 95.1725 95.18261 95.05638 98.72181 98.72556 98.72082 0.995295 0.995285 0.99527 54926 
Accuracy 95.18261 98.72556 99.52846 
Table 5. Performance of each classifier on test set for 8-classes detection task 
classification report - 8 classes (7 attack types + normal) 
Model a LOBM mr RE af ZGB 
Precision Recall fl-score Precision Recall tl-score Precision Recall _fl-score Support 
Normal (0) 98.80697 97.75462 98.27798 99.53863 98.59669 99.06542 99.93709 99.44814 99.69201 43127 
NMRI (1) 76.66022 82.17001 79.31955 86.8472 93.02486 89.82994 95.68021 97.63158 96.64604 1520 
CMRI (2) 80.13042 85.89638 82.91328 88.4158 94.27403 91.25099 96.39432 98.58768 97.47867 2549 
MSCI (3) 91.64557 96.92102 94.2095 98.29114 99.16986 98.72854 97.8481 99.93536 98.88072 1547 
MPCI (4) 99.43655 100 99.71748 97.15826 99.89924 98.50969 98.67712 99.97518 99.32191 4029 
MFCI (5) 100 98.89001 99.44191 100 98.79032 99.39148 100 99.08999 99.54292 989 
DoS (6) 79.77011 93.531 86.10422 94.25287 98.79518 96.47059 93.7931 99.7555 96.68246 409 
Recon (7) 97.29032 100 98.62655 97.16129 98.56021 97.85575 97.54839 100 98.75898 756 
Weighted avg 97.12042 96.98503 97.03749 98.43663 98.37236 98.39378 99.42612 99.41376 99.41707 54926 
Accuracy _96.98503 98.37236 99.41376 
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3.2. Prediction results 


All classification metrics (precision, recall, f1-score) of three individual models (LGBM, RF, XGB) 
for 2-class and 8-class detection tasks are shown in Table 4 and Table 5, respectively. As seen in both tables, 
the XGB is the best among three individual classifiers, and the LGBM is the worst model. The fl-score of the 
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XGB model reaches 99.53% for the binary detection task and 98.37% for the categorical detection task. 


Table 6 and Table 7 show the prediction results of our proposed stacking models. The results of both 
stacking models are more accurate than that of three individual classifiers. For the binary task, the accuracy and f1- 
score of the stacking model are the same, with 99.72%. For categorical tasks, the accuracy and fl-score are 99.62% 
and 99.63%, respectively. Moreover, Figure 5 shows the detailed quality of two stacking models through confusion 
matrices. The detection rates (recall) of attacked and normal states are greater than 99.32%, with an overall accuracy 
of 99.83% in Figure 5(a). For categorical tasks, the detection rates of almost all attack types are higher than 97.55% 


(except for DoS-95.17%), with an overall accuracy of 99.62% in Figure 5(b). 


Table 6. Performance of the stacking model on test set for 2-class detection task 


2 classes Precision Recall fl-score Support 
Normal (0) 99.80898 99.83456 99.82177 42916 
Attack (1) 99.40828 99.31724 99.36274 12010 

Weighted avg 99.72136 99.72144 99.7214 54926 

Accuracy 99.72144 


Table 7. Performance of the stacking model on test set for 8-class detection task 


8 classes Precision Recall fl-score Support 
Normal (0) 99.86019 99.75095 99.80554 42963 
NMRI (1) 97.54997 97.48711 97.51853 1552 
CMRI (2) 98.04373 98.57308 98.30769 2593 
MSCI (3) 99.36709 99.43002 99.39854 1579 
MPCI (4) 99.75502 99.85287 99.80392 4078 
MFCI (5) 100 99.39148 99.69481 986 
Dos (6) 95.17241 99.51923 97.2973 416 
Recon (7) 97.93548 100 98.95698 759 
Weighted avg 99.62759 99.62495 99.62568 54926 
Accuracy 99.62495 
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Figure 5. Detailed quality through confusion matrices (a) confusion matrix of binary classification task and (b) 
categorical classification task 
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A comparison of prediction results of our proposed model with other recent works which 
implemented on the same gas pipeline dataset is given in Table 8. As seen in Table 8, almost works take 
single classification, except the ensemble of SAE model in [22], which achieved an accuracy of 95.86% and 
93.83% for fl-score. Our proposed model showed a relatively high prediction rate for both binary and 
categorical tasks, as compare to another. 


Table 8. Comparitive results of our proposed model with other recent works 


. Binary task Categorical task 
Methods. Accuracy (%) fl-score (%)) Accuracy (%)) _ fl-score (%)) 

Bagged tree [20] 98.2 

LSTM [23] 92 85 - - 
K-means, NB, PCA-SVD, GMM [24]] 83.19 (K-means)) 86.05 (NB) - - 
Ensemble of SAE [22] 95.86 93.83 - - 
GoogLeNet-LSTM [21] 97.56 - - - 

RF [22] 99.58 99.58 99.41 99.41 
RandomTree [25] 97.8 - - - 

Our ensemble model 99.72% 99.72 99.62 99.62 


4. CONCLUSION 

In this work, we have proposed one type of stacking model to improve the quality of intrusion 
detection in SCADA systems. The first layer of the proposed model is the combination of random forest, 
light boosting gradient machine, and eXtreme gradient boosting models. We use an MLP network as a meta- 
classifier of the model. The proposed model is optimized and tested on an international dataset (gas pipeline 
dataset). Testing results are a prospect, in which the detection accuracy is 99.72% and 99.62% for binary and 
categorical detection tasks, respectively. In our future works, all the binary, categorical and specific results in 
the gas pipeline dataset are considered. A variant version of the proposed stacking model will be developed 
and tested to deal with this problem. 
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